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Abstract: 

This  project  seeks  to  develop  new  learning  algorithms  specifically  tailored  to  be  efficient 
and  effective  in  learning  from  big  data.  It  aims  to  achieve  this  by  combining  generative 
and  discriminative  learning,  exploiting  the  capacity  of  generative  learning  to  efficiently 
extract  useful  summary  statistics  from  large  data  and  using  discriminative  learning  to 
meld  these  statistics  into  a  highly  accurate  classifier. 

We  have  developed  two  classes  of  learning  algorithm  that  utilize  this  overarching 
strategy.  The  first  uses  discriminative  learning  to  select  a  generative  model.  In  this 
case  the  generative  model  is  either  Averaged  n  Dependence  Estimators  (ANDE)  or  K- 
Dependence  Bayes  (KDB)  and  the  discriminative  technique  is  feature  selection.  We  have 
shown  that  very  effective  feature  selection  can  be  achieved  with  a  single  pass  through 
the  training  data  for  each  attribute  that  is  finally  selected.  We  have  demonstrated  two 
benefits  of  feature  selection.  First,  feature  selection  reduces  the  bias  of  the  classifier, 
which  typically  decreases  error  for  larger  data.  Second,  feature  selection  decreases  the 
memory  requirements  of  the  classifier,  allowing  higher  values  of  n  to  be  used,  further 
decreasing  bias  and  further  reducing  error  on  large  data. 

The  second  class  of  learning  algorithm  combines  generatively  and  discriminatively 
learned  parameters.  Weighting  attributes  to  Alleviate  Naive  Bayes’  Independence  As¬ 
sumption  (WANBIA)  uses  discriminative  learning  of  weights  in  an  otherwise  generatively 
learned  naive  Bayes  classifier.  WANBIA-C  modifies  the  WANBIA  model  in  such  a  way 
as  to  enable  models  that  are  equivalent  to  Logistic  Regression  models  to  be  learned.  The 
resulting  learner  creates  classifiers  that  are  of  equivalent  accuracy  to  those  learned  by 
Logistic  Regression;  sometimes  the  classifiers  are  slightly  better  and  sometimes  slightly 
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worse,  but  neither  algorithm  having  a  systematic  advantage.  However,  the  learning  pro¬ 
cess  for  WANBIA-C  is  substantially  faster  than  that  for  Logistic  Regression,  sometimes 
orders  of  magnitude  faster.  We  have  demonstrated  that  these  techniques  generalize  to 
models  that  directly  model  higher-order  attribute  interdependencies  and  developed  a 
proof-of-concept  system  WANJE. 

We  have  also  investigated  the  purely  generative  techniques  of  Subsumption  Res¬ 
olution  and  Submodel  Weighting  in  ANDE  and  demonstrated  that  they  are  effective 
with  higher  orders  of  n  and  that  they  are  complementary  to  one  another.  We  intend 
these  techniques  to  be  incorporated  into  future  combined  generative/discriminative  ap¬ 
proaches. 

In  the  process  of  creating  algorithms  for  learning  from  high-throughput  data  streams, 
we  have  developed  effective  algorithms  for  discretization  of  streaming  numeric  data. 

1  Introduction 

Machine  learning  algorithms  can  be  categorized  into  two  families,  generative  and  discrim¬ 
inative  learners  [12].  Generative  learners  seek  to  model  the  joint  distribution  between 
the  independent  X  variables  and  the  dependent  Y  variable.  They  classify  by  calculating 
the  posterior  probability  P(Y  \  X )  from  the  joint  probability.  In  contrast,  discriminative 
learners  seek  to  directly  model  the  posterior  probability,  without  seeking  to  model  the 
probabilities  of  the  independent  X  variables. 

The  ANDE  learners  that  we  have  developed  [19]  are  generative  learners  that  are 
parameterized  using  maximum  likelihood  estimates.  Maximum  likelihood  parameteriza¬ 
tion  is  extremely  efficient  and  can  be  performed  in  a  single  out-of-core  pass  through  the 
training  data.  The  resulting  models  have  been  shown  to  be  highly  accurate  for  moderate 
quantities  of  data.  However,  discriminative  learning  is  better  able  to  model  the  desired 
posterior  distribution  when  sufficient  training  data  are  available. 

The  current  project  seeks  to  combine 

•  the  manner  in  which  the  efficient  learning  of  valuable  information  in  a  generative 
manner  with  maximum  likelihood  estimation;  and 

•  the  capacity  of  discriminative  learning  to  more  accurately  model  the  posterior 
distribution  when  there  are  sufficient  data  to  avoid  overfitting. 

We  are  pursuing  two  main  strategies.  The  first  uses  discriminative  learning  to  select 
between  alternative  generative  models  on  the  basis  of  discriminative  performance.  The 
second  uses  maximum  likelihood  parameterization  to  augment  discriminative  learning 
of  discriminative  models. 

A  key  theoretical  insight  that  drives  our  research  is  derived  from  the  bias/variance 
trade-off  [12].  Most  learning  algorithms  represent  a  trade-off  between  variance  that 
results  from  overfitting  the  data,  and  bias  that  results  from  failure  to  appropriately  fit 
the  data.  Low  variance  algorithms  achieve  relatively  low  error  on  small  quantities  of 
data,  while  low  bias  algorithms  achieve  relatively  low  error  on  larger  quantities  of  data. 
This  is  illustrated  in  Figure  3. 
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Data  quantity 

Figure  1:  Learning  algorithms  that  closely  fit  closely  fit  complex  multivariate  distribu¬ 
tions  overfit  small  data,  but  achieve  lower  error  on  large  data 


2  Scalable  discriminative  extensions  to  ANDE 

In  this  line  of  research  we  seek  to  improve  the  scalability  of  the  Averaged  N-Dependence 
Estimators  (ANDE)  family  of  algorithms.  These  algorithms  are  attractive  for  learning 
form  large  data  because  their  training  time  is  linear  with  respect  to  data  quantity  and 
they  can  learn  out-of-core  in  a  single  pass  through  the  data.  Hence,  they  can  learn 
from  data  that  are  too  large  to  fit  into  RAM.  Further,  their  models  can  be  learned 
incrementally,  and  hence  they  are  suitable  for  streaming  data.  However,  for  large  data 
quantity  it  is  desirable  to  have  low  bias  [1,  2],  and  hence  it  is  desirable  to  have  ANDE 
with  higher  values  of  n.  This  is  problematic  because  ANDE  does  not  scale  well  to  high 
n. 


2.1  Background 

Averaged  N-Dependence  Estimators  (ANDE)  classify  using  the  model 


PAnDE(y,x)  oc  < 


^2  ti(xs)P(y,xs)P(x  |  y,xs)/  ^2 

PA(n-l)DE  (l/>x) 


y  s{xs)  >  o 
*#)  (i) 

otherwise 


The  parameters  to  the  model  (P (y,xs)  and  P(x  |  y,xs))  are  learned  by  maximum 
likelihood  estimation  in  a  single  pass  through  the  data  that  can  be  implemented  out-of- 
core. 
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Table  1:  Notation 


P(e) 

the  unconditioned  probability  of  event  e 

P(e  |  w) 

the  conditional  probability  of  event  e  given  event  w 

P(e) 

an  estimate  of  P(e) 

Paode(c) 

an  AODE  estimate  of  P(e) 

PAnDE(e) 

an  AtrDE  estimate  of  P(e) 

PwAnDE(e) 

a  weighted  AnDE  estimate  of  P(e) 

PAnDESR(e) 

an  AtrDE  with  Subsumption  Resolution  estimate  of  P  (e) 

PwANJE(e) 

a  WANJE  estimate  of  P(e) 

a 

the  number  of  attributes 

a 

the  ith  class 

k 

the  number  of  classes 

t 

the  number  of  training  examples  in  T 

V 

the  average  number  of  values  per  attribute 

y 

a  value  from  the  set  of  all  classes  {ci, . . .  c&} 

T 

a  training  sample  of  t  classified  objects 

II 

"tr 

an  object 

Xi 

the  value  of  the  ith  attribute  of  x  =  (aq, . . . ,  xa) 

the  subset  of  attributes  values  from  x  with  the  specified  indices.  For 
example,  £(2,3,5}  =  (x2,  £3,  £5} 

(i) 

the  set  of  all  size-?r  subsets  of  {1, ...  a} 

5(xa) 

a  function  that  is  1  if  T  contains  an  object  with  the  value  xa,  other¬ 
wise  0 

TP 

the  parents  of  attribute  in  a  Bayesian  Network  Classifier 

7 Ty 

the  parents  of  the  class  attribute  Y  in  a  Bayesian  Network  Classifier 

Averaged  One-Dependence  Estimators  (AODE)  refers  to  ANDE  with  n  =  1  [18]. 
ANDE  models  are  extremely  efficient  to  learn  with  low  n.  Increasing  n  reduces  bias, 
which  supports  more  accurate  learning  from  larger  data.  However,  its  training  time 
complexity  is  0(t(  “1)),  as  we  need  to  update  each  entry  for  every  combination  of  the 
n  +  1  attribute- values  for  every  instance.  The  time  complexity  of  classifying  a  single 
example  is  0(A;a(“)). 

2.2  Weighting  and  Subsumption  Resolution 

The  first  study  [26]  took  two  techniques,  subsumption  resolution  [28]  and  mutual  infor¬ 
mation  weighting  [9]  that  have  been  developed  for  AODE  (ANDE  with  n  =  1),  assessed 
their  interaction,  and  assessed  whether  they  are  effective  with  higher  values  of  n. 

Subsumption  resolution  is  an  effective  technique  for  rectifying  a  specific  class  of 
extreme  violations  of  the  attribute  independence  assumption,  those  where  P(xj  |  Xj)  = 
1.0.  In  this  case  P (y  |  x)  =  P (y  |  £{i.„j_i,j+i...m})  and  hence  all  inaccuracies  introduced 
into  P(y  |  x)  by  this  violation  of  the  attribute  independence  assumption  can  be  avoided 
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by  dropping  xt  from  (1).  For  example,  when  the  attribute  values  include  female  and 
pregnant  only  the  latter  should  be  used,  when  they  include  male  and  not-pregnant  only 
the  former  should  be  used,  and  when  they  include  female  and  not-pregnant  both  should 
be  used.  This  requires,  however,  that  one  infer  whether  P(.Tj  |  y,xs)  =  1  for  each  pair 
of  attribute  values.  In  the  current  research  we  infer  that  P(xj  |  Xj)  =  1.0  if  ff(xj)  = 
Jf(xi,Xj)  >  100,  where  if{xj)  is  the  count  of  the  number  of  times  attribute  value  Xj 
occurs  in  the  data  and  jf(xi,Xj)  is  the  count  of  the  number  of  times  both  Xi  and  Xj 
occur  together  in  the  data.  To  prevent  both  attribute  values  being  deleted  if  they  cover 
exactly  the  same  data,  we  delete  the  one  with  the  higher  index  if  ff{xi)  =  ff{xj). 

-^>AnDESR(^’  x)  ^  PAnDE(2/,  x{iex.:-’3j£x.#(xi)=#(xi,Xj)>100A[#(xj)>#(xi)\/j<i]}) 

Subsumption  resolution  has  been  shown  to  be  effective  at  reducing  the  bias  of  AIDE 
[29,  28], 

Another  approach  to  reducing  bias  in  AnDE  that  has  been  shown  to  be  effective  for 
AIDE  [4,  9,  22]  is  to  weight  the  sub-models,  modifying  (1)  to 


PwAnDE(y,  x)  OC  < 


^<5(xs)u;sP(y,  xs)  P(x* 

-etf)  *=1 


S(xs)  :  ^2  >  0 


PwA(n-l)DE(2/! 


X 


:  otherwise 


WAODE  [9]  weights  AIDE,  where  s  is  a  single  attribute  value.  It  sets  ws  to  the  mutual 
information  [13]  of  the  attribute  with  the  class.  WAODE  is  effective  at  reducing  the 
bias  of  AIDE  with  minimal  computational  overhead.  We  here  generalize  that  strategy 
to  Mi-weighted  AnDE,  using  ws  =  MI(S,  Y), 

MI(S,  Y)  =  £  P(*»,  y)  log  <2) 

yeY  Xsexs  v  s) 


where  Y  is  the  set  of  class  labels  and  Xs  is  the  cross  product  of  values  for  attributes 
with  indices  in  s. 


2.3  Study 

We  implemented  Average  1  and  2  Dependence  Estimators  with  both  subsumption  reso¬ 
lution  and  mutual  information  weighting  in  the  Weka  machine  learning  workbench. 

We  compared  the  two  levels  of  ANDE  without  either  extension,  with  each  in  isolation 
and  with  both  in  combination.  We  also  compared  the  important  comparators  naive  Bayes 
(NB)  and  Random  Forest  with  10  trees  (RF10). 

These  algorithms  were  all  applied  to  71  benchmark  datasets. 
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2.4  Results 

We  found  that 

•  Both  weighting  and  subsumption  resolution  reduce  the  bias  of  both  AIDE  and 
A2DE  significantly  more  often  than  they  increase  it. 

•  Jointly  applying  both  weighting  and  subsumption  resolution  to  either  AIDE  or 
A2DE  reduces  bias  significantly  more  often  than  it  increases  it  relative  to  applying 
either  alone. 

•  Both  weighting  and  subsumption  resolution  increase  the  variance  of  both  AIDE 
and  A2DE  more  often  than  they  decrease  it,  although  these  results  are  not  always 
statistically  significant. 

•  Jointly  applying  both  weighting  and  subsumption  resolution  to  either  AIDE  or 
A2DE  increases  variance  more  often  than  it  decrease  it  relative  to  applying  either 
alone,  but  these  differences  are  also  not  always  statistically  significant. 

•  Random  Forest  has  lower  bias  and  higher  variance  significantly  more  often  than 
the  reverse  relative  to  all  AnDE  variants. 

•  Subsumption  resolution  decreases  error  more  often  than  not  relative  to  both  AIDE 
and  A2DE  for  both  measures  of  error  and  for  almost  all  of  the  different  data 
collections.  The  exceptions  are  AIDE,  0-1  loss,  medium  data  and  A2DE,  0-1  loss, 
small  data  for  which  there  are  draws.  However,  not  all  these  results  are  statistically 
significant. 

•  Subsumption  resolution  with  weighting  can  decrease  both  RMSE  and  0-1  loss  for 
the  first  two  collections  (all  and  large  data  sets).  As  predicted,  the  effectiveness 
reduces  as  data  set  sizes  reduce  and  for  medium  data  sets,  subsumption  resolution 
with  weighting  can  have  slightly  worst  performance  relative  to  weighting  in  terms 
of  0-1  loss  but  better  in  terms  of  RMSE.  The  results,  however,  are  non-significant. 
The  same  pattern  can  be  observed  in  smaller  data  sets  with  subsumption  resolution 
and  weighting  not  very  effective. 

•  Subsumption  resolution  in  tandem  with  weighting  can  project  AnDE  to  be  com¬ 
petitive  to  RF10,  winning  significantly  on  all  data  sets  in  terms  of  the  two  error 
measures  on  all  and  small  data  sets.  On  medium  data  sets,  it  results  in  winning 
significantly  often  for  A2DE  and  non-significant  often  for  AIDE  over  RF10.  On 
large  data  sets,  both  AIDE  and  A2DE  lose  to  RF10.  The  results  are,  however, 
not  significant.  With  five  wins  and  seven  losses  over  RF10,  we  conjecture,  that 
AnDE  with  subsumption  resolution  and  weighting,  with  all  desirable  properties  of 
learning  from  big  data,  is  a  strong  contender  for  big  data  learning. 

The  average  results  of  classification  and  learning  time  for  all  the  compared  techniques 
are  shown  in  figure  2.  One  can  see  that  subsumption  resolution  can  greatly  reduce 
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Figure  2:  Averaged  Learning  and  Classification  timing  results  normalized  with  respect  to 


NB. 


A2DE’s  classification  time.  While  A2DE-S  and  A2DE-SW  require  only  slightly  less 
training  time  on  average  than  RF10,  the  training  time  complexity  of  AnDE  and  its 
variants  is  linear  with  respect  to  data  quantity  while  RFlO’s  is  super-linear,  as  shown  by 
the  difference  between  training  times  for  all  data  and  for  large  data.  The  training  time 
advantage  would  substantially  increase  if  RF10  were  applied  to  data  that  were  too  large 
to  maintain  in  RAM.  A2DE  and  its  variants  require  substantially  more  classification  time 
than  RF10,  even  with  the  decreases  introduced  by  subsumption  resolution.  However, 
it  can  be  seen  that  the  classification  time  of  RF10  is  also  super- linear  with  respect  to 
training  set  size,  whereas  AnDE’s  is  not.  This  is  due  to  the  size  of  the  trees  increasing 
as  the  data  quantity  increases. 

More  detail  about  this  study  can  be  found  in  our  PAKDD  paper  [26]. 

2.5  Fast  and  effective  attribute  selection 

The  second  study  [5]  investigated  a  two-pass  approach  to  attribute  selection  for  AODE. 
Previous  research  [22,  23,  30]  has  shown  that  attribute  selection  can  be  very  effective 
at  reducing  the  bias  of  AODE.  However,  these  techniques  have  used  computationally 
expensive  wrapper  techniques  for  attribute  selection  that  are  infeasible  to  implement 
out-of-core.  To  make  attribute  selection  feasible  out-of-core  it  is  necessary  to  add  only  a 
small  number  of  passes  through  the  data.  The  approach  we  investigate  here  is  based  on 
the  observation  that  it  is  possible  to  nest  a  large  space  of  alternative  models  such  that 
each  is  a  trivial  extension  of  another.  Let  p  and  c  be  the  set  of  indices  of  parent  and  child 
attributes,  respectively.  For  every  attribute  Xi,  the  AODE  models  that  use  attributes  in 
p  as  parents  and  attributes  in  c  U  {i}  as  children  are  minor  extensions  of  a  model  that 
uses  attributes  in  p  as  parents  and  attributes  in  c  as  children.  The  same  is  true  of  models 
that  use  attributes  in  p  U  {i}  as  parents  and  attributes  in  c  as  children.  Importantly, 
multiple  models  that  build  upon  one  another  in  this  way  can  be  efficiently  evaluated  in 
a  single  set  of  computations.  Using  this  observation,  we  create  a  space  of  models  that 
are  nested  together,  and  then  select  the  best  model  using  leave-one-out  cross  validation 
in  single  extra  pass  through  the  training  data. 
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The  mutual  information  [13]  between  an  attribute  and  the  class  measures  how  in¬ 
formative  this  attribute  is  about  the  class,  and  thus  it  is  a  suitable  metric  to  rank  the 
attributes.  It  can  be  computed  using  the  information  already  gathered  by  AODE  in  its 
first  pass  through  the  training  data. 

We  extend  AODE  by  adding  a  second  pass  through  the  training  data.  The  first  pass 
constructs  the  full  AODE  model.  The  attributes  are  then  ranked  in  descending  order 
of  mutual  information  with  the  class.  This  is  used  to  create  a  two-dimensional  space  of 
nested  models.  In  one  dimension  we  have  n  increasing  subsets  of  the  attributes  used  in 
the  parent  role.  In  the  other  dimension  we  have  n  increasing  subsets  of  the  attributes 
used  in  the  child  role.  Using  incremental  cross-validation  [10]  it  is  possible  to  efficiently 
perform  cross-validation  on  an  AODE  model  in  a  single  pass  through  the  data.  As 
the  models  being  assessed  are  nested,  they  can  be  evaluated  in  this  manner  in  a  single 
pass  using  very  little  more  computation  than  is  required  to  evaluate  the  full  model  in 
isolation. 

In  this  manner  we  perform  discriminative  evaluation  of  all  n  x  n  nested  models  in 
a  single  computationally  efficient  pass.  Any  measure  of  classification  performance  could 
potentially  be  employed.  In  this  study  we  used  root  mean  squared  error  (R.MSE),  which 
is  an  effective  measure  of  the  calibration  of  the  probability  estimates  produced  by  a 
model. 

In  an  extensive  study  we  compare  the  proposed  attribute  selective  AODE  (ASAODE) 
with  AODE,  mutual  information  weighted  AODE  (WAODE),  AODE  with  subsumption 
resolution  (AODESR),  BSE  selective  AODE  (BSEAODE)  and  A2DE  on  the  same  71 
benchmark  datasets  used  in  the  first  study. 

2.6  Results 

The  empirical  results  show  that  the  new  algorithm  is  significantly  more  accurate  than 
AODE,  WAODE  and  AODESR. 

It  has  comparable  error  to  BSEAODE,  while  requiring  substantially  less  computation 
and  being  executed  out-of-core. 

As  we  expected,  it  is  not  as  accurate  on  average  as  A2DE.  However,  in  continuing 
research  we  are  implementing  the  approach  with  A2DE  and  A3DE,  with  extremely 
promising  initial  results. 

Our  new  out-of-core  attribute  selection  algorithm  requires  significantly  less  training 
time  than  BSEAODE,  and  less  classification  time  than  AODE  and  all  other  variants, 
especially  than  A2DE. 

3  Scalable  discriminative  extensions  to  Bayesian  Network  Classifiers 

Standard  Bayesian  Network  Classifiers  can  also  be  improved  using  our  approach  of  single¬ 
pass  discriminative  selection  between  a  large  class  of  nested  models.  Like  the  work  with 
ANDE,  our  motivation  is  to  use  efficient  out-of-core  discriminative  learning  to  improve 
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models  learned  using  efficient  out-of-core  generative  learning  using  maximum  likelihood 
parameterization. 

Bayesian  Network  Classifiers  are  defined  by  parent  relation  7r  and  Conditional  Prob¬ 
ability  Tables  (CPTs).  n  encodes  conditional  independence  relationships  between  the 
attributes.  CPTs  encode  the  specific  conditional  probabilities. 

A  Bayesian  Network  Classifier  classifies  using 

P (y  I  x)  oc  P (y  |  Try)  Y[  POi  I  TTi)  (3) 

The  class  variable  Y  is  usually  set  to  be  a  parent  of  all  attributes  X{. 

Given  7 r,  CPTs  can  be  learned  by  counting  joint  frequencies  in  a  single  single  pass 
through  the  data. 

K-Dependence  Bayes  (KDB)  [15]  learns  a  restricted  Bayesian  Network  Classifier  in 
two  passes  through  the  learning  data.  In  the  first  [ass  it  learns  the  structure.  In  the 
second  pass  it  learns  the  parameters  to  the  CPTs. 

To  learn  the  structure  it  first  collect  counts  for  all  pairs  of  attributes  with  the  class. 
It  then  orders  the  attributes  based  on  mutual  information  with  the  class.  Each  attribute 
is  then  assigned  parents  such  that 

•  each  has  no  more  than  k  parents; 

•  an  attributes  parents  must  be  earlier  in  the  order  than  the  child; 

•  within  these  constraints  the  parents  are  selected  that  maximize  mutual  information 
between  the  parent  and  child  conditioned  on  the  class. 

The  parameter  k  controls  the  bias  variance  trade-off.  Low  k  has  low  variance  but  high 
bias,  and  is  well  suited  to  small  quantities  of  data,  while  higher  k  has  higher  variance 
but  lower  bias  and  is  better  suited  to  larger  quantities  of  data.  This  is  illustrated  in 
Figure  3.  Further,  spurious  attributes  may  increase  error. 

KDB  models  are  naturally  nested.  A  model  with  k  =  v  subsumes  a  model  with 
k  =  v  —  1.  A  model  using  the  first  w  attributes  as  children  subsumes  a  model  using  the 
first  w  —  1  attributes  as  children.  Thus  it  can  be  seen  that  a  KDB  model  with  k  =  v 
and  w  attributes  subsumes  v  x  w  submodels.  All  of  these  submodels  can  be  assessed  in 
a  single  out-of-core  pass  through  the  training  data  using  incremental  cross-validation. 
Due  to  the  nested  nature  of  the  models,  this  process  takes  little  more  computation  than 
incremental  cross  validation  of  the  full  model. 

Our  algorithm,  Selective  KDB  performs  such  evaluation  in  a  third  pass  through  the 
data,  resulting  in  a  very  efficient  three  pass  out-of-core  algorithm. 

Let  a  be  the  number  of  attributes;  v  be  the  average  number  of  values;  y  be  the 
number  of  classes;  a*  be  the  number  of  attributes  selected  (a*  <  a);  and  k*  be  the  best 
value  of  k  found  (k*  <  kmax).  Selective  KDB’s  training  space  complexity  is  O (ya2v2  + 
yavk+1).  Its  classification  space  complexity  is  0(y a*  vk  +1).  Its  training  time  complexity 
is  O  (ta2  +  ya?v 2  +  tayk) .  Its  classification  time  complexity  is  O  (ya*k*). 

On  large  datasets  our  out-of-core  algorithm  has  very  competitive  error  to  state-of-the- 
art  in-core  algorithm  Random  Forests  [3]  (RF),  to  state-of-the-art  out-of-core  algorithm 
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Figure  3:  Learning  curves  for  KDB  classifiers  with  varying  k  on  the  benchmark  poker- 
hand  dataset.  Lower  values  of  k  (naive  Bayes  is  KDB  with  k  =  0)  have  low  variance  and 
have  low  error  on  small  data,  but  their  high  bias  means  their  error  asymptotes  quickly, 
while  the  lower  bias  of  KDB  with  higher  k  leads  to  lower  error  on  large  quantities  of 
data. 


Vowpal  Wabbit  (VWLF)  [11].  As  illustrated  in  Figure  4,  it  has  lower  error  significantly 
more  often  than  not  relative  to  state-of-the-art  Bayesian  Network  Classifiers,  including 
computationally  expensive  in-core  approaches  (BayesNet)  and  out-of-core  naive  Bayes 
(NB),  AODE  and  TAN.  As  illustrated  in  Figures  5  and  6,  Selective  KDB  is  an  order 
of  magnitude  more  efficient  in  training  than  in-core  Random  Forests,  and  two  orders  of 
magnitude  more  efficient  than  conventional  Bayesian  Network  Classifiers.  It  is  also  sub¬ 
stantially  more  efficient  than  state-of-the-art  out-of-core  logistic  regression  algorithm 
Vowpal  Wabbit  under  either  of  its  two  main  objective  functions,  squared  loss  (VWSF) 
or  logistic  loss  (VWLF). 

A  paper  presenting  these  results  in  greater  detail  has  been  submitted  for  journal 
publication. 
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Learners 

Figure  4:  The  RMSE  rank  of  alternative  algorithms  over  16  large  datasets.  Out-of-core 
Selective  KDB  achieves  lower  error  more  often  than  any  alternative,  including  state-of- 
the-art  in-core  Random  Forests,  and  significantly  more  often  than  any  of  the  Bayesian 
Network  Classifiers. 
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Figure  5:  The  relative  training  times  of  in-core  algorithms  BayesNet  and  Random  Forests 
compared  with  out-of-core  naive  Bayes  and  Selective  KDB 
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Figure  6:  Training  and  classification  time  comparisons  for  the  out-of-core  classifiers. 


4  WANBIA 


A  number  of  researchers  have  previously  investigated  adding  weights  to  naive  Bayes 
models  [8,  6,  27,  7,  21].  Naive  Bayes  classifies  using 

n 

PnbO,x)  oc  P(y)  |  y).  (4) 

i=  1 

There  are  three  main  variants  of  weighted  naive  Bayes.  In  the  most  general  case, 
this  weight  depends  on  the  attribute  value  and  class: 


P(y,x)  oc  P(y)Wy  JJPOi  |  y)Wi’xi’v.  (5) 

i=  1 


Doing  this  results  in  Yi  \%i\  weight  parameters.  A  second  possibility  is  to  give  a  single 
weight  per  attribute: 

a 

P(y,x)  oc  P(y)Wy  |  y)Wi.  (6) 

i=  1 

One  final  possibility  is  to  set  all  weights  to  a  single  value: 


P(y, x)  =  P(y)Wy 


I  y) 


(7) 


Equation  6  is  a  special  case  of  Equation  5,  where  wi,j,y  =Wi,  and  Equation  7  is  a 
special  case  of  Equation  6  where  Vj  Wi  =  w. 
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Most  previous  research  into  weighting  naive  Bayes  has  used  the  weights  to  give 
greater  emphasis  to  more  informative  attributes.  We  argue  that  this  is  suboptimal,  as 
Bayesian  approaches  automatically  take  account  of  the  amount  of  information  about  a 
class  that  each  attribute  contributes.  Indeed,  naive  Bayes  is  a  Bayes  optimal  classifier 
except  insofar  as  the  base  probability  estimates  are  incorrect  and  that  the  assumption 
that  the  attributes  are  conditionally  independent  is  violated.  We  argue  that  it  is  in  this 
latter  respect  that  weighting  has  the  most  to  offer  and  propose  to  address  it  through 
discriminative  learning. 

Our  initial  algorithm  WANBIA  learns  weights  of  the  form  in  Equation  6.  A  second 
algorithm,  WANBIA-C  learns  weights  of  the  form  in  Equation  5. 

The  approach  is  inspired  by  MAPLMG  [4],  a  system  that  learns  weights  on  the 
sub-models  in  AODE  by  maximizing  Conditional  Log-Likelihood. 

To  this  end  we  have  derived  the  gradient  of  Equations  5  and  6  with  respect  to  Con¬ 
ditional  Log-Likelihood.  This  allows  the  use  of  a  regular  gradient  descent  optimization 
procedure. 

WANBIA-C  learns  a  model  that  maps  directly  onto  the  standard  logistic  regression 
model.  By  optimizing  for  the  same  objective  function  as  used  by  logistic  regression  we 
ensure  that  the  models  learned  will  be  equivalent,  except  insofar  as  the  optimization 
process  fails  to  find  the  true  optima. 

We  show  that  WANBIA  substantially  reduces  the  bias  of  naive  Bayes  at  the  cost  of 
a  modest  increase  in  variance.  For  any  but  extremely  small  datasets  it  tends  to  produce 
lower  error  models  than  naive  Bayes.  The  WANBIA  models  are  more  biased  than  logistic 
regression,  but  with  fewer  free  parameters  have  lower  variance.  As  a  result  they  have 
very  competitive  error  for  small  to  medium  sized  datasets.  With  fewer  free  parameters 
the  optimization  process  is  also  considerably  more  efficient. 

As  shown  in  Figure  7,  when  using  unconstrained  optimization  WANBIA-C  has  similar 
RMSE  and  0-1  loss  to  logistic  regression  but  the  models  converge  much  faster  due  to  the 
naive  Bayes  probability  estimates  acting  as  an  effective  preconditioner. 

Of  particular  significance  for  stream  learning,  as  shown  in  Figure  8,  WANBIA-C 
substantially  outperforms  logistic  regression  when  optimized  using  stochastic  gradient 
descent. 

Full  details  can  be  found  in  our  Journal  of  Machine  Learning  Research  and  IEEE 
International  Conference  on  Data  Mining  papers  [25,  24], 

5  WANJE 

One  limitation  of  WANBIA  and  WANBIA-C  is  that  they  use  linear  models  that  cannot 
directly  capture  high-order  multivariate  interactions  in  data. 

Our  approach  of  using  maximum  likelihood  parameterization  to  speed  up  discrimi¬ 
native  parameter  optimization  using  conditional  log-likelihood  can  also  be  generalized 
to  other  forms  of  Bayesian  Network  Classifier  that  directly  model  higher-order  interac¬ 
tions.  However,  to  make  the  optimization  feasible  it  is  important  that  it  form  a  convex 
optimization  problem,  and  only  moral  Bayesian  networks  do  so  [14]. 
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Figure  7:  Comparison  of  RMSE  (Left)  and  number  of  iterations  (Right)  of  WANBIA-C  and 
LR  on  73  datasets  using  quasi-Newton  optimization.  LR  parameters  are  initialized  to  NB  MAP 
estimates.  Number  of  iterations  are  on  log-scale. 


The  ANDE  classifiers  form  moral  networks, 
attributes  and  class  require 


n  +  1 


n 


However,  each  combination  of  n  +  1 


(8) 


parameters  to  model,  where  k  is  the  number  of  classes,  n  is  the  number  of  parents  per 
sub-model  (the  n  from  ANDE),  and  v  is  the  average  number  of  values  per  attribute. 
This  is  because  for  every  combination  of  n  parents  and  class  we  need  to  model  the  CPT 
for  the  child  attribute. 

This  results  in  a  very  large  number  of  parameters  and  a  very  difficult  optimization 
task  for  high  values  of  n. 

To  address  this  issue  we  have  developed  a  variant  of  ANDE  that  uses  a  single  pa¬ 
rameter  for  each  combination  of  attribute  and  class  values. 

It  is  possible  to  partition  the  attributes  X  into  any  set  of  sets  of  attributes  V  such 
that  U a£V  a  =  X  and  VaeP,  /3eV,  a  n  /?  =  0  and  then  by  assuming  independence  only 
between  the  different  sets  of  attributes  one  obtains 


p(x  I  y)  =  II  p(«  I  V )•  (9) 

aev 

For  example,  if  there  are  four  attributes  x\,  x2  ,  X3  and  X4  that  are  partitioned  into  the 
sets  {x\,X2}  and  {.T3,X4}  then  by  assuming  conditional  independence  between  the  sets 
we  obtain 

P(x1,x2,x3,  x4\  y)  =  P(xi,x2  |  y)P (x3,  x4  |  y).  (10) 

The  ANJE  model  is  equivalent  to  the  geometric  mean  of  all  such  models  in  which  all 
subsets  are  of  size  N. 


P(x  |  y)  =  []  P(a  |  y)(N-i)\(A-N)\/(A-iy.  (n) 

_/X\ 
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Figure  8:  RMSE  comparison  of  LR  and  WANBIA-C  with  SGD  on  Localization, 
Census-income,  MITFaceSetB,  MSDYearPrediction,  Covtype,  MITFaceSetC,  USCensusl990, 
and  Activity  datasets. 


where  A  =  |X|  and  (^)  is  the  set  of  sets  of  attributes  of  size  N,  {a  \  a  C  X,  |a|  =  IV}. 

Equation  11  is  derived  as  follows.  Let  S  =  \V\,  the  number  of  sets  of  attributes  in 
each  model  of  the  form  (9).  Let  T  be  the  total  number  of  these  sets,  T  =  |(^)|  =  (^). 
Let  M  be  the  total  number  of  models  of  the  form  (9)  of  which  (11)  represents  the 
geometric  mean.  This  differs  depending  on  whether  one  allows  different  permutations  or 
only  combinations  of  the  different  sets  of  attributes.  It  turns  out  that  this  value  cancels 
out,  so  we  do  not  need  to  specify  what  it  is.  Let  X  be  the  number  of  models  in  which 
each  set  of  attributes  appears.  As  this  is  identical  for  all  sets  of  attributes,  and  as  the 
total  number  of  terms  in  all  models  is  MS ,  this  equals  MS/T.  Let  j  be  the  jth  set  of 
attributes  in  the  ith  model.  The  geometric  mean  of  all  models  is 
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M  S 

p(x  i  y)  =  \  n  n  p(xt’i  i 

\  *=1  3= 1 
M  S 

=nnp(^iy)i/M 

i=l  j= 1 

=  n  p(«i?/)x/M- 

<*(£) 

The  final  exponent  X/M  can  be  expanded  as  follows 


(12) 

(13) 

(14) 

(15) 

(16) 

(17) 

(18) 
(19) 


Substituting  (19)  for  the  exponent  in  (14)  we  get  (11).  We  can  then  classify  using 

P (y  |  x)  oc  P (y)  ]^[  P(a  |  (20) 

<*(£) 

An  ANJE  model  with  a  given  N  models  the  same  order  of  interactions  between 
attributes  as  an  ANJE  model  with  n  =  N  —  1,  because  n  refers  to  the  number  of  parent 
attributes  for  each  child  in  an  ANDE  model  while  N  refers  to  a  clique  of  attributes  in 
an  ANJE  model. 

Our  experiments  indicate  that  ANJE  has  substantially  higher  bias  than  ANDE  where 
n  =  N  —  1 ,  and  that  the  resulting  reduction  in  variance  is  rarely  being  sufficient  to  lead 
to  lower  error.  However,  the  lower  number  of  parameters  required  by  the  ANJE  models 
can  make  it  feasible  to  model  higher  order  interactions  than  can  be  modeled  by  ANDE.  It 
is  our  belief  that  usually  for  large  data  modeling  higher  order  interaction  will  be  more 
effective  at  reducing  bias  than  more  detailed  modeling  of  the  lower  order  interactions. 

ANJE  serves  as  an  effective  basis  for  discriminative  weighting  of  the  form  used  in 
WANBIA  and  WANBIA-C. 


pwanjew(2/  I  x) «  P(y)Wy  II  p(«  I  y)Wa'v-  (21) 

«e(£) 

Our  experiments  show  that  the  resulting  models  are  very  competitive  with  state-of- 
the-art  learning  algorithm  Random  Forest.  Of  particular  significance,  we  have  devel¬ 
oped  an  out-of-core  single  pass  variant,  where  the  maximum  likelihood  estimates  are 
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Figure  9:  Learning  curves  for  single-pass  out-of-core  incremental  WANJE  and  multiple- 
pass  in-core  batch  Random  Forests  on  the  Poker-Hand  dataset. 


updated  incrementally  and  the  weights  are  learned  using  stochastic  gradient  descent. 
As  illustrated  in  Figure  9,  our  experiments  show  that  for  large  datasets  this  single  pass 
out-of-core  incremental  learner  is  very  competitive  with  state-of-the-art  in-core  batch 
learner  Random  Forests. 

We  are  in  the  process  of  writing  a  research  paper  on  this  work. 


6  Incremental  discretization 

The  ANDE,  WANBIA  and  WANJE  algorithms  require  that  numeric  data  be  discre- 
tised.  While  investigating  the  potential  to  extend  our  approaches  to  streaming  data  we 
were  confronted  with  the  problem  of  discretizing  such  data.  This  is  on  the  face-of-it 
problematic,  because  streaming  data  require  cutpoints  that  vary  over  time,  while  many 
algorithms  require  the  meaning  of  a  cutpoint  to  remain  invariant.  In  this  line  of  research 
we  investigated  this  issue  and  identified  that  there  is  a  manner  to  allow  cutpoints  to 
drift  while  maintaining  invariant  meaning. 

This  can  be  achieved  by  performing  quantile-based  discretization,  where  each  cut- 
point  is  set  at  a  specific  quantile  in  the  current  data  distribution.  This  way  the  meaning 
of  the  cutpoint  is  invariant  even  while  the  value  changes.  For  example,  the  cutpoint 
for  an  interval  representing  high  income  should  grow  over  time  as  inflation  increases 
incomes.  If  it  is  set  at  the  upper  20  percentile,  then  it  will  do  so. 

We  developed  an  algorithm,  IDA,  for  efficient  quantile-based  discretization  of  stream¬ 
ing  data.  It  operates  by  maintaining  a  sample  of  the  data  encountered  in  the  stream 
because  1)  it  is  not  feasible  for  high-throughput  streams  to  maintain  a  complete  record 
of  all  values  observed  to  date;  2)  it  is  computationally  efficient;  and  3)  it  is  possible  to 
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place  tight  bounds  on  the  expected  variance  of  the  cut  points  . 

We  used  the  reservoir  sampling  algorithm  [17]  to  maintain  a  random  sample  for  each 
attribute. 

We  developed  an  efficient  data  structure  and  algorithms  for  storing  and  updating  the 
sample  using  a  vector  of  interval  heaps  [16].  This  data  structure  ensures  that  insertion 
and  deletion  are  of  order  0( log  s).  where  s  is  the  sample  size,  and  retrieving  a  cut  point 
is  constant  time. 

We  also  developed  a  variant  of  IDA,  IDAW,  that  tracks  the  current  distribution  by 
using  a  window  of  recent  examples  in  place  of  a  random  sample  of  all  examples.  This  is 
computationally  more  intensive,  as  it  requires  an  update  at  every  time  step. 

The  computational  complexity  of  IDA  is  dominated  by  the  costs  of  maintaining  the 
samples  and  determining  the  quantiles  from  those  samples.  The  required  operations  are 
to  insert  a  new  value  (only  required  while  the  sample  is  not  yet  at  full  size),  to  replace 
a  random  value  with  a  new  value,  and  to  return  the  required  quantiles. 

As  each  bin  is  maintained  as  an  interval  heap  [16] ,  finding  the  quantiles  takes  constant 
time  and  inserting  or  removing  a  value  from  a  bin  V-  takes  0(log  \V?\)  =  0(log(s/m)) 
time.  As  replacement  requires  up  to  m  insertions  and  deletions,  replacement  requires 
order  0(m\og{s/m))  time. 

However,  these  relatively  expensive  updates  are  only  required  on  average  once  every 
s/t  updates,  where  t  is  the  current  time  step  or  size  of  the  stream  to  date.  Thus 
the  amortized  cost  is  0(Ei=i  w-log i/m  +  X)i=B+i  f mlogs/m]/t),  where  the  first  term 
represents  the  initial  s  time  steps  during  which  the  sample  is  built  up  to  its  operating 
size  and  the  second  term  represents  updates  to  the  sample  once  it  reaches  operating  size. 
It  is  readily  apparent  that  these  updates  rapidly  become  very  rare  and  that  as  the  size 
of  the  stream  becomes  very  large  the  amortized  cost  becomes  negligibly  small. 

The  situation  is  more  complex  for  IDAW,  which  maintains  a  window  of  the  s  most 
recent  values  for  each  attribute.  This  requires  that  the  values  be  maintained  in  both 
time  and  value  order.  Maintaining  an  order  by  time  can  be  achieved  very  efficiently 
with  a  circular  buffer,  which  supports  all  updates  and  accesses  in  constant  time.  As  the 
elements  to  be  replaced  in  a  replacement  operation  are  no  longer  selected  at  random,  it 
is  not  efficient  to  maintain  the  bins  as  interval  heaps,  as  above.  Rather  we  need  to  use 
slightly  more  expensive  balanced  binary  trees  for  which  the  time  to  identify  the  location 
of  the  value  to  be  removed  is  0(log(s/m)),  which  this  does  not  increase  the  overall 
complexity  of  the  update  operation  relative  to  that  for  IDA.  The  major  computational 
penalty,  however,  is  that  these  updates  must  be  performed  for  every  object  encountered 
in  the  queue,  which  makes  the  maintenance  of  the  discretization  a  non-trivial  ongoing 
overhead. 

6.1  Evaluation 

We  conducted  a  series  of  studies  on  5  benchmark  stream  classification  datasets  perform¬ 
ing  classification  with  Logistic  Regression  learned  with  Stochastic  Gradient  Descent,  as 
this  is  an  effective  stream  classification  learning  algorithm. 
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In  the  absence  of  concept  drift  the  use  of  sampling  was  demonstrated  to  result  in 
only  negligible  loss  in  accuracy. 

On  data  with  concept  drift  IDA  has  only  negligible  loss  in  accuracy  relative  to  main¬ 
taining  a  complete  quantile-based  discretization  relative  to  the  entire  stream  encountered 
to  date.  This  computationally  intensive  comparator  approach  would  not  be  feasible  in 
practice. 

The  IDAW  approach  delivered  very  substantial  reductions  in  error  for  two  data 
streams  and  substantial  increases  for  two  more.  The  reductions  demonstrate  that  for 
appropriate  forms  of  data  maintaining  discretizations  based  on  quantiles  as  they  vary 
over  time  can  maintain  relevant  meaning  while  the  cutpoints  vary.  However,  the  poor 
results  on  some  other  data  streams  demonstrate  that  this  is  not  always  appropriate. 

Both  IDA  and  IDAW  consistently  and  substantially  reduced  error  relative  to  Logistic 
Regression  on  the  undiscretized  numeric  data. 

Full  details  can  be  found  in  our  ICDM  2014  paper  [20]. 
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